library(topGO)
library(knitr)   # to use kable()
library(limma)   # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
                   isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')

Introduction

This is the enrichment analysis for isoforms regulated by regime. Because I quantified the association of expression with regime in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with isoforms ordered by the proportion of expression variance explained by regime. Note that some isoforms with low variance may still be highly associated with regime, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of isoforms based on the significance of the differential expression between levels of regime, which does depend on fold change.

Reading the data

Functional annotation is in 2019-07-26. I will also upload two lists of isoforms, with either proportion of variance explained by regime or p-value of differential expression test.

tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue   <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation  <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))

To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the isoforms identifiers and the numeric values, either the portion of variance explained by regime or the p-values of the differential expression test. The structure() function adds attributes to an object.

Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues  <- structure(tagPValue[,1],   names = row.names(tagPValue))
rm(tagVariance, tagPValue)

There are 33037 isoforms scored with a variance portion and a differential expression p-value. It should actually be the exact same isoforms. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.

head(annotation)
##          tagname
## 1 TCONS_00000002
## 2 TCONS_00000010
## 3 TCONS_00000011
## 4 TCONS_00000016
## 5 TCONS_00000017
## 6 TCONS_00000018
##                                                             goterms
## 1                                  GO:0008417|GO:0016020|GO:0006486
## 2                                  GO:0043130|GO:0005515|GO:0043161
## 3                       GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 6 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)

There are 19735 isoforms with GO annotations. But the differential expression analysis includes many more isoforms. In order to include the not-annotated isoforms in the enrichment analysis, to see if annotated or not annotated isoforms are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.

for (tag in unique(c(names(PValues), names(Variance)))) {
   if (! tag %in% names(allgenes2GO)) {
      allgenes2GO <- append(allgenes2GO,
         structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
   }
}

Using differential expression p-values

Building the topGO object

Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.

I generate three datasets, to analyse each of the three ontologies.

selection <- function(allScores) {return(allScores < 0.01)}
GOdata.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
   Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 26235 1582
MF 30580 791
CC 21616 390

Running the tests

There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.

BP.elim     <- runTest(GOdata.BP, algorithm = "elim",     statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea      <- runTest(GOdata.BP, algorithm = "lea",      statistic = "ks")
MF.elim     <- runTest(GOdata.MF, algorithm = "elim",     statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea      <- runTest(GOdata.MF, algorithm = "lea",      statistic = "ks")
CC.elim     <- runTest(GOdata.CC, algorithm = "elim",     statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea      <- runTest(GOdata.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
   Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1582 19
BP weight01 1582 18
BP lea 1582 24
MF elim 791 20
MF weight01 791 19
MF lea 791 29
CC elim 390 13
CC weight01 390 11
CC lea 390 21
rm(ResultsSummary)

Results

The topGO package only looks for GO terms that are overrepresented among the top of the gene list (genes with lowest score, assumed to be a \(p\) value; see 2020-06-30). Thus, underrepresented GO terms are ignored, even though it could be interesting to know if among differentially expressed genes there are fewer genes annotated with certain terms than expected by chance. The way to identify those terms would be to reverse the scores. In this case, running the analysis with \(1 - p\) values would be to search for terms that are overrepresented among non-differentially expressed genes. Not worth pursuing now.

Note that, despite my initial confusion (see previous commits), it is not true that significant terms with fewer significant genes than expected are significantly underrepresented terms. It is just possible that a term is significant because of the overall distribution of scores of genes annotated with that term, even if none of those genes (or just fewer than expected) are significant. This is because of the use of Kolmogorov-Smirnov test, instead of the Fisher’s exact test. Note that even though no hard threshold is necessary in the K-S test, GenTable() produces a summary table with number of annotated and significant genes for every significant GO term. That number of significant term is determined by the selection() function passed to the topGO object, only for display purposes unless using Fisher’s exact test.

Biological process

orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms

BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
write.table(BP.all, file=paste(TAG, VAR, 'BPsummary.tsv', sep='/'),
            quote=FALSE, sep='\t', row.names=FALSE)

kable(BP.all,
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0007186 G protein-coupled receptor signaling pat… 302 17 23.25 1 4.2e-06 2.7e-06 4.2e-06
GO:0006508 proteolysis 822 89 63.29 2 1.5e-05 3.1e-06 5.8e-06
GO:0060271 cilium assembly 73 3 5.62 5 0.00022 1.3e-05 5.5e-07
GO:0006355 regulation of transcription, DNA-templat… 621 44 47.81 3 3.2e-05 4.9e-05 3.2e-05
GO:0060294 cilium movement involved in cell motilit… 7 1 0.54 4 0.00020 0.00020 0.00020
GO:0007154 cell communication 1169 69 90.01 830 0.70082 0.00021 0.73799
GO:0000079 regulation of cyclin-dependent protein s… 9 4 0.69 6 0.00030 0.00030 0.00030
GO:0006813 potassium ion transport 103 12 7.93 7 0.00044 0.00047 0.00044
GO:0005991 trehalose metabolic process 22 8 1.69 9 0.00107 0.00107 1.3e-05
GO:0044271 cellular nitrogen compound biosynthetic … 1365 99 105.10 1483 0.99078 0.00252 0.98536
GO:0035082 axoneme assembly 18 1 1.39 8 0.00047 0.00273 0.00047
GO:0006520 cellular amino acid metabolic process 208 18 16.02 480 0.41000 0.00309 0.59686
GO:0005992 trehalose biosynthetic process 7 4 0.54 11 0.00315 0.00315 0.00315
GO:0006030 chitin metabolic process 103 18 7.93 14 0.00656 0.00601 0.00656
GO:0005975 carbohydrate metabolic process 360 45 27.72 123 0.07878 0.00763 0.06199
GO:0006979 response to oxidative stress 40 12 3.08 15 0.00790 0.00790 0.00790
GO:0006367 transcription initiation from RNA polyme… 20 7 1.54 17 0.00828 0.00828 0.00828
GO:0031146 SCF-dependent proteasomal ubiquitin-depe… 5 0 0.38 18 0.00893 0.00893 0.00893
GO:0042246 tissue regeneration 6 2 0.46 20 0.01006 0.01006 0.01006
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term = Term(GOTERM[sigTerms]),
                 Definition = Definition(GOTERM[sigTerms]),
                 PValue=score(BP.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with',
                      VAR, 'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0007186 G protein-coupled receptor signaling pathway A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). 0.0000027
GO:0006508 proteolysis The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. 0.0000031
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0000132
GO:0006355 regulation of transcription, DNA-templated Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. 0.0000488
GO:0060294 cilium movement involved in cell motility Movement of cilia mediated by motor proteins that contributes to the movement of a cell. 0.0001995
GO:0000079 regulation of cyclin-dependent protein serine/threonine kinase activity Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. 0.0003019
GO:0006813 potassium ion transport The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0004678
GO:0005991 trehalose metabolic process The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0010651
GO:0035082 axoneme assembly The assembly and organization of an axoneme, the bundle of microtubules and associated proteins that forms the core of cilia (also called flagella) in eukaryotic cells and is responsible for their movements. 0.0027293
GO:0005992 trehalose biosynthetic process The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0031481
GO:0006030 chitin metabolic process The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0060109
GO:0006979 response to oxidative stress Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of oxidative stress, a state often resulting from exposure to high levels of reactive oxygen species, e.g. superoxide anions, hydrogen peroxide (H2O2), and hydroxyl radicals. 0.0079030
GO:0006367 transcription initiation from RNA polymerase II promoter Any process involved in the assembly of the RNA polymerase II preinitiation complex (PIC) at an RNA polymerase II promoter region of a DNA template, resulting in the subsequent synthesis of RNA from that promoter. The initiation phase includes PIC assembly and the formation of the first few bonds in the RNA chain, including abortive initiation, which occurs when the first few nucleotides are repeatedly synthesized and then released. Promoter clearance, or release, is the transition between the initiation and elongation phases of transcription. 0.0082808
GO:0031146 SCF-dependent proteasomal ubiquitin-dependent protein catabolic process The chemical reactions and pathways resulting in the breakdown of a protein or peptide by hydrolysis of its peptide bonds, initiated by the covalent attachment of ubiquitin, with ubiquitin-protein ligation catalyzed by an SCF (Skp1/Cul1/F-box protein) complex, and mediated by the proteasome. 0.0089260

I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.

showSigOfNodes(GOdata.BP, score(BP.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 166 
## Number of Edges = 300 
## 
## $complete.dag
## [1] "A graph with 166 nodes."

This is just a example of the most significant GO term:

showGroupDensity(GOdata.BP, orderedTerms[1], rm.one=FALSE)

showGroupDensity(GOdata.BP, orderedTerms[2], rm.one=FALSE)

showGroupDensity(GOdata.BP, orderedTerms[3], rm.one=FALSE)

Molecular function

orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms

MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
write.table(MF.all, file=paste(TAG, VAR, 'MFsummary.tsv', sep='/'),
            quote=FALSE, sep='\t', row.names=FALSE)

kable(MF.all,
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0005509 calcium ion binding 638 61 48.86 1 3.0e-16 3.0e-16 3.0e-16
GO:0043565 sequence-specific DNA binding 221 11 16.93 3 0.00035 5.7e-05 0.00035
GO:0004930 G protein-coupled receptor activity 251 16 19.22 2 9.9e-06 5.8e-05 1.3e-06
GO:0008417 fucosyltransferase activity 42 1 3.22 4 0.00041 0.00041 0.00041
GO:0003774 motor activity 277 12 21.21 156 0.13853 0.00044 0.13853
GO:0030248 cellulose binding 8 5 0.61 5 0.00054 0.00054 0.00054
GO:0004555 alpha,alpha-trehalase activity 15 4 1.15 6 0.00113 0.00113 0.00113
GO:0008134 transcription factor binding 33 3 2.53 20 0.00828 0.00118 0.00828
GO:0005200 structural constituent of cytoskeleton 23 3 1.76 7 0.00121 0.00121 0.00121
GO:0004181 metallocarboxypeptidase activity 39 5 2.99 8 0.00178 0.00178 0.00178
GO:0016462 pyrophosphatase activity 952 68 72.91 210 0.23026 0.00297 0.68247
GO:0004198 calcium-dependent cysteine-type endopept… 62 7 4.75 10 0.00301 0.00301 0.00301
GO:0004252 serine-type endopeptidase activity 143 24 10.95 11 0.00329 0.00329 0.00329
GO:0008061 chitin binding 112 16 8.58 14 0.00393 0.00393 0.00393
GO:0004601 peroxidase activity 59 12 4.52 93 0.06946 0.00397 0.06946
GO:0004611 phosphoenolpyruvate carboxykinase activi… 7 1 0.54 15 0.00440 0.00440 0.00440
GO:0004983 neuropeptide Y receptor activity 11 0 0.84 18 0.00670 0.00670 0.00670
GO:0005267 potassium channel activity 80 10 6.13 12 0.00353 0.00684 0.00353
GO:0008047 enzyme activator activity 135 10 10.34 109 0.08050 0.00834 0.08050
GO:0030246 carbohydrate binding 83 17 6.36 28 0.01315 0.01064 0.04417
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MF.weight01)[sigTerms]),
      caption = paste('Molecular function terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular function terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005509 calcium ion binding Interacting selectively and non-covalently with calcium ions (Ca2+). 0.0000000
GO:0043565 sequence-specific DNA binding Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. 0.0000573
GO:0004930 G protein-coupled receptor activity Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. 0.0000582
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0004103
GO:0030248 cellulose binding Interacting selectively and non-covalently with cellulose. 0.0005392
GO:0004555 alpha,alpha-trehalase activity Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. 0.0011331
GO:0008134 transcription factor binding Interacting selectively and non-covalently with a transcription factor, any protein required to initiate or regulate transcription. 0.0011813
GO:0005200 structural constituent of cytoskeleton The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. 0.0012128
GO:0004181 metallocarboxypeptidase activity Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. 0.0017817
GO:0004198 calcium-dependent cysteine-type endopeptidase activity Catalysis of the hydrolysis of nonterminal peptide bonds in a polypeptide chain by a mechanism using a cysteine residue at the enzyme active center, and requiring the presence of calcium. 0.0030053
GO:0004252 serine-type endopeptidase activity Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). 0.0032931
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0039312
GO:0004611 phosphoenolpyruvate carboxykinase activity Catalysis of the reaction: source of phosphate + oxaloacetate = phosphoenolpyruvate + CO2 + other reaction products. 0.0043958
GO:0004983 neuropeptide Y receptor activity Combining with neuropeptide Y to initiate a change in cell activity. 0.0067048
GO:0005267 potassium channel activity Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. 0.0068424
showSigOfNodes(GOdata.MF, score(MF.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 86 
## Number of Edges = 102 
## 
## $complete.dag
## [1] "A graph with 86 nodes."
showGroupDensity(GOdata.MF, orderedTerms[1], rm.one=FALSE)

showGroupDensity(GOdata.MF, orderedTerms[2], rm.one=FALSE)

showGroupDensity(GOdata.MF, orderedTerms[3], rm.one=FALSE)

Cellular component

orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms

CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

write.table(CC.all, file=paste(TAG, VAR, 'CCsummary.tsv', sep='/'),
            quote=FALSE, sep='\t', row.names=FALSE)

kable(CC.all,
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0005576 extracellular region 197 38 15.03 1 1.7e-06 3.3e-05 1.7e-06
GO:0005743 mitochondrial inner membrane 44 12 3.36 2 4.8e-05 0.00032 4.8e-05
GO:0016459 myosin complex 57 3 4.35 3 0.00036 0.00036 0.00036
GO:0005929 cilium 52 3 3.97 6 0.00071 0.00082 0.00071
GO:0016021 integral component of membrane 1739 155 132.66 4 0.00046 0.00236 8.1e-05
GO:0005886 plasma membrane 324 34 24.72 7 0.00082 0.00445 0.00097
GO:0090575 RNA polymerase II transcription factor c… 47 11 3.59 31 0.02538 0.00503 0.02538
GO:0098800 inner mitochondrial membrane protein com… 27 7 2.06 26 0.02146 0.00526 0.02146
GO:0042555 MCM complex 12 5 0.92 11 0.00624 0.00624 0.00624
GO:0099512 supramolecular fiber 52 6 3.97 9 0.00344 0.00766 0.00344
GO:0031225 anchored component of membrane 8 4 0.61 13 0.00813 0.00813 0.00813
GO:0008076 voltage-gated potassium channel complex 30 5 2.29 14 0.01150 0.01150 0.01150
GO:0016020 membrane 3059 260 233.36 30 0.02483 0.01168 2.1e-07
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CC.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005576 extracellular region The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. 0.0000329
GO:0005743 mitochondrial inner membrane The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. 0.0003155
GO:0016459 myosin complex A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. 0.0003574
GO:0005929 cilium A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. 0.0008211
GO:0016021 integral component of membrane The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. 0.0023586
GO:0005886 plasma membrane The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins. 0.0044522
GO:0042555 MCM complex A hexameric protein complex required for the initiation and regulation of DNA replication. 0.0062379
GO:0099512 supramolecular fiber A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. 0.0076615
GO:0031225 anchored component of membrane The component of a membrane consisting of the gene products that are tethered to the membrane only by a covalently attached anchor, such as a lipid group that is embedded in the membrane. Gene products with peptide sequences that are embedded in the membrane are excluded from this grouping. 0.0081261
showSigOfNodes(GOdata.CC, score(CC.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 63 
## Number of Edges = 110 
## 
## $complete.dag
## [1] "A graph with 63 nodes."
showGroupDensity(GOdata.CC, orderedTerms[1], rm.one=FALSE)

showGroupDensity(GOdata.CC, orderedTerms[2], rm.one=FALSE)

showGroupDensity(GOdata.CC, orderedTerms[3], rm.one=FALSE)

Using the portion of variance explained by regime

Building the topGO object

I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by regime. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to \(p\)-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by regime to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by regime.

selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
   Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 26235 1582
MF 30580 791
CC 21616 390

Running the tests

BPvar.elim     <- runTest(GOdataVar.BP, algorithm = "elim",     statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea      <- runTest(GOdataVar.BP, algorithm = "lea",      statistic = "ks")
MFvar.elim     <- runTest(GOdataVar.MF, algorithm = "elim",     statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea      <- runTest(GOdataVar.MF, algorithm = "lea",      statistic = "ks")
CCvar.elim     <- runTest(GOdataVar.CC, algorithm = "elim",     statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea      <- runTest(GOdataVar.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
   Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1582 22
BP weight01 1582 23
BP lea 1582 29
MF elim 791 29
MF weight01 791 27
MF lea 791 39
CC elim 390 11
CC weight01 390 12
CC lea 390 19
rm(ResultsSummary)

Results

Biological process

orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms

BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(BPvar.all,
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0007186 G protein-coupled receptor signaling pat… 302 44 29.72 1 9.7e-07 3.4e-08 9.7e-07
GO:0006508 proteolysis 822 121 80.90 5 0.00019 1.5e-07 1.2e-05
GO:0060271 cilium assembly 73 11 7.18 3 7.2e-05 5.7e-05 5.0e-07
GO:0006355 regulation of transcription, DNA-templat… 621 65 61.12 2 6.8e-05 0.00019 6.8e-05
GO:0006813 potassium ion transport 103 14 10.14 4 7.5e-05 0.00046 7.5e-05
GO:0005992 trehalose biosynthetic process 7 4 0.69 7 0.00064 0.00064 0.00064
GO:0060294 cilium movement involved in cell motilit… 7 4 0.69 8 0.00067 0.00067 0.00067
GO:0006030 chitin metabolic process 103 19 10.14 6 0.00034 0.00071 0.00034
GO:0006520 cellular amino acid metabolic process 208 18 20.47 719 0.66735 0.00075 0.65245
GO:0007160 cell-matrix adhesion 24 8 2.36 9 0.00075 0.00075 0.00075
GO:0006801 superoxide metabolic process 20 1 1.97 11 0.00107 0.00107 0.00107
GO:0000079 regulation of cyclin-dependent protein s… 9 5 0.89 12 0.00112 0.00112 0.00112
GO:0005975 carbohydrate metabolic process 360 56 35.43 52 0.02497 0.00167 0.01764
GO:0006814 sodium ion transport 108 13 10.63 13 0.00167 0.00167 0.00167
GO:0007154 cell communication 1169 107 115.05 799 0.73112 0.00185 0.74893
GO:0005991 trehalose metabolic process 22 9 2.17 14 0.00188 0.00188 5.4e-06
GO:0044271 cellular nitrogen compound biosynthetic … 1365 104 134.34 1469 0.99399 0.00189 0.94722
GO:0006821 chloride transport 15 1 1.48 10 0.00081 0.00212 0.00081
GO:0006367 transcription initiation from RNA polyme… 20 8 1.97 15 0.00272 0.00272 0.00272
GO:0035082 axoneme assembly 18 4 1.77 16 0.00303 0.00327 0.00303
GO:0071805 potassium ion transmembrane transport 20 3 1.97 28 0.01748 0.00528 0.01748
GO:0046173 polyol biosynthetic process 8 3 0.79 218 0.14874 0.00618 0.14874
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(BPvar.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0007186 G protein-coupled receptor signaling pathway A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). 0.0000000
GO:0006508 proteolysis The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. 0.0000001
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0000567
GO:0006355 regulation of transcription, DNA-templated Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. 0.0001890
GO:0006813 potassium ion transport The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0004553
GO:0005992 trehalose biosynthetic process The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0006394
GO:0060294 cilium movement involved in cell motility Movement of cilia mediated by motor proteins that contributes to the movement of a cell. 0.0006687
GO:0006030 chitin metabolic process The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0007143
GO:0007160 cell-matrix adhesion The binding of a cell to the extracellular matrix via adhesion molecules. 0.0007536
GO:0006801 superoxide metabolic process The chemical reactions and pathways involving superoxide, the superoxide anion O2- (superoxide free radical), or any compound containing this species. 0.0010662
GO:0000079 regulation of cyclin-dependent protein serine/threonine kinase activity Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. 0.0011196
GO:0006814 sodium ion transport The directed movement of sodium ions (Na+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0016691
GO:0005991 trehalose metabolic process The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. 0.0018778
GO:0006821 chloride transport The directed movement of chloride into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. 0.0021196
GO:0006367 transcription initiation from RNA polymerase II promoter Any process involved in the assembly of the RNA polymerase II preinitiation complex (PIC) at an RNA polymerase II promoter region of a DNA template, resulting in the subsequent synthesis of RNA from that promoter. The initiation phase includes PIC assembly and the formation of the first few bonds in the RNA chain, including abortive initiation, which occurs when the first few nucleotides are repeatedly synthesized and then released. Promoter clearance, or release, is the transition between the initiation and elongation phases of transcription. 0.0027247
GO:0035082 axoneme assembly The assembly and organization of an axoneme, the bundle of microtubules and associated proteins that forms the core of cilia (also called flagella) in eukaryotic cells and is responsible for their movements. 0.0032734
GO:0042246 tissue regeneration The regrowth of lost or destroyed tissues. 0.0084031
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 174 
## Number of Edges = 306 
## 
## $complete.dag
## [1] "A graph with 174 nodes."

Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.BP, orderedTerms[1], rm.one=FALSE)

showGroupDensity(GOdataVar.BP, orderedTerms[2], rm.one=FALSE)

showGroupDensity(GOdataVar.BP, orderedTerms[3], rm.one=FALSE)

Molecular function

orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms

MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(MFvar.all,
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0005509 calcium ion binding 638 116 64.38 1 4.6e-18 4.6e-18 4.6e-18
GO:0004930 G protein-coupled receptor activity 251 39 25.33 2 5.2e-07 3.1e-07 1.2e-08
GO:0043565 sequence-specific DNA binding 221 23 22.30 3 7.6e-05 3.5e-06 7.6e-05
GO:0003774 motor activity 277 50 27.95 16 0.00499 9.0e-05 0.00499
GO:0004252 serine-type endopeptidase activity 143 31 14.43 4 0.00035 0.00035 0.00035
GO:0008061 chitin binding 112 18 11.30 5 0.00054 0.00054 0.00054
GO:0030248 cellulose binding 8 2 0.81 6 0.00085 0.00085 0.00085
GO:0004181 metallocarboxypeptidase activity 39 10 3.94 7 0.00091 0.00091 0.00091
GO:0008417 fucosyltransferase activity 42 3 4.24 10 0.00149 0.00149 0.00149
GO:0004555 alpha,alpha-trehalase activity 15 5 1.51 11 0.00220 0.00220 0.00220
GO:0016462 pyrophosphatase activity 952 109 96.07 491 0.76140 0.00244 0.85685
GO:0005267 potassium channel activity 80 13 8.07 8 0.00102 0.00282 0.00102
GO:0016651 oxidoreductase activity, acting on NAD(P… 47 5 4.74 116 0.09384 0.00431 0.09384
GO:0005200 structural constituent of cytoskeleton 23 10 2.32 13 0.00467 0.00467 0.00467
GO:0004553 hydrolase activity, hydrolyzing O-glycos… 169 33 17.05 40 0.02098 0.00480 0.02098
GO:0005247 voltage-gated chloride channel activity 7 0 0.71 17 0.00501 0.00501 0.00501
GO:0004040 amidase activity 5 4 0.50 18 0.00552 0.00552 0.00552
GO:0004198 calcium-dependent cysteine-type endopept… 62 13 6.26 19 0.00594 0.00594 0.00594
GO:0031409 pigment binding 11 4 1.11 20 0.00602 0.00602 0.00602
GO:0030246 carbohydrate binding 83 17 8.38 46 0.02647 0.00627 0.04503
GO:0005507 copper ion binding 29 8 2.93 21 0.00646 0.00646 0.00646
GO:0140098 catalytic activity, acting on RNA 278 12 28.05 544 0.83730 0.00709 0.99630
GO:0005272 sodium channel activity 83 11 8.38 22 0.00722 0.00722 0.00722
GO:0051015 actin filament binding 47 12 4.74 24 0.00740 0.00740 0.00740
GO:0005201 extracellular matrix structural constitu… 7 3 0.71 27 0.00901 0.00901 0.00901
GO:0008083 growth factor activity 8 4 0.81 28 0.00918 0.00918 0.00918
GO:0008199 ferric iron binding 15 4 1.51 29 0.00992 0.00992 0.00992
GO:0004601 peroxidase activity 59 16 5.95 23 0.00732 0.01186 0.00732
GO:0005249 voltage-gated potassium channel activity 54 8 5.45 31 0.01347 0.01252 0.01347
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MFvar.weight01)[sigTerms]),
      caption = paste('Molecular functions terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular functions terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005509 calcium ion binding Interacting selectively and non-covalently with calcium ions (Ca2+). 0.0000000
GO:0004930 G protein-coupled receptor activity Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. 0.0000003
GO:0043565 sequence-specific DNA binding Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. 0.0000035
GO:0003774 motor activity Catalysis of the generation of force resulting either in movement along a microfilament or microtubule, or in torque resulting in membrane scission, coupled to the hydrolysis of a nucleoside triphosphate. 0.0000895
GO:0004252 serine-type endopeptidase activity Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). 0.0003473
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0005355
GO:0030248 cellulose binding Interacting selectively and non-covalently with cellulose. 0.0008517
GO:0004181 metallocarboxypeptidase activity Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. 0.0009053
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0014917
GO:0004555 alpha,alpha-trehalase activity Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. 0.0021975
GO:0005267 potassium channel activity Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. 0.0028229
GO:0005200 structural constituent of cytoskeleton The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. 0.0046674
GO:0005247 voltage-gated chloride channel activity Enables the transmembrane transfer of a chloride ion by a voltage-gated channel. A voltage-gated channel is a channel whose open state is dependent on the voltage across the membrane in which it is embedded. 0.0050061
GO:0004040 amidase activity Catalysis of the reaction: a monocarboxylic acid amide + H2O = a monocarboxylate + NH3. 0.0055211
GO:0004198 calcium-dependent cysteine-type endopeptidase activity Catalysis of the hydrolysis of nonterminal peptide bonds in a polypeptide chain by a mechanism using a cysteine residue at the enzyme active center, and requiring the presence of calcium. 0.0059422
GO:0031409 pigment binding Interacting selectively and non-covalently with a pigment, any general or particular coloring matter in living organisms, e.g. melanin. 0.0060164
GO:0005507 copper ion binding Interacting selectively and non-covalently with copper (Cu) ions. 0.0064552
GO:0005272 sodium channel activity Enables the facilitated diffusion of a sodium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. 0.0072212
GO:0051015 actin filament binding Interacting selectively and non-covalently with an actin filament, also known as F-actin, a helical filamentous polymer of globular G-actin subunits. 0.0073983
GO:0005201 extracellular matrix structural constituent The action of a molecule that contributes to the structural integrity of the extracellular matrix. 0.0090108
GO:0008083 growth factor activity The function that stimulates a cell to grow or proliferate. Most growth factors have other actions besides the induction of cell growth or proliferation. 0.0091849
GO:0008199 ferric iron binding Interacting selectively and non-covalently with ferric iron, Fe(III). 0.0099227
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 109 
## Number of Edges = 137 
## 
## $complete.dag
## [1] "A graph with 109 nodes."

I plot variance portion, but for the term found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.MF, orderedTerms[1], rm.one=FALSE)

showGroupDensity(GOdataVar.MF, orderedTerms[2], rm.one=FALSE)

showGroupDensity(GOdataVar.MF, orderedTerms[3], rm.one=FALSE)

Cellular component

orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
   sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms

CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))

kable(CCvar.all,
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
GO:0005576 extracellular region 197 39 19.49 1 2.0e-06 1.9e-05 2.0e-06
GO:0005743 mitochondrial inner membrane 44 11 4.35 3 5.5e-05 2.8e-05 5.5e-05
GO:0016459 myosin complex 57 21 5.64 4 7.0e-05 7.0e-05 7.0e-05
GO:0005929 cilium 52 15 5.15 5 0.0014 9.5e-05 0.00017
GO:0016021 integral component of membrane 1739 215 172.08 2 3.4e-05 0.00024 4.7e-06
GO:0016020 membrane 3059 337 302.70 11 0.0089 0.00231 2.1e-07
GO:0090575 RNA polymerase II transcription factor c… 47 6 4.65 83 0.2041 0.00294 0.20411
GO:0005891 voltage-gated calcium channel complex 11 4 1.09 6 0.0031 0.00310 0.00310
GO:0005886 plasma membrane 324 40 32.06 26 0.0296 0.00417 0.00160
GO:0099512 supramolecular fiber 52 14 5.15 9 0.0076 0.00469 0.00764
GO:0008076 voltage-gated potassium channel complex 30 6 2.97 8 0.0055 0.00552 0.00552
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CCvar.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with regime according to all 3 algorithms
Term Definition PValue
GO:0005576 extracellular region The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. 0.0000195
GO:0005743 mitochondrial inner membrane The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. 0.0000282
GO:0016459 myosin complex A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. 0.0000700
GO:0005929 cilium A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. 0.0000950
GO:0016021 integral component of membrane The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. 0.0002372
GO:0016020 membrane A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it. 0.0023057
GO:0005891 voltage-gated calcium channel complex A protein complex that forms a transmembrane channel through which calcium ions may pass in response to changes in membrane potential. 0.0031048
GO:0099512 supramolecular fiber A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. 0.0046859
GO:0008076 voltage-gated potassium channel complex A protein complex that forms a transmembrane channel through which potassium ions may cross a cell membrane in response to changes in membrane potential. 0.0055216
GO:0005581 collagen trimer A protein complex consisting of three collagen chains assembled into a left-handed triple helix. These trimers typically assemble into higher order structures. 0.0086510
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
               firstSigNodes = sum(significant.weight01),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 61 
## Number of Edges = 103 
## 
## $complete.dag
## [1] "A graph with 61 nodes."

For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.

showGroupDensity(GOdataVar.CC, orderedTerms[1], rm.one=FALSE)

showGroupDensity(GOdataVar.CC, orderedTerms[2], rm.one=FALSE)

showGroupDensity(GOdataVar.CC, orderedTerms[3], rm.one=FALSE)

Comparison between the two ordering of genes.

Biological process

allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% BP.pvalue.sigTerms,
                             Variance = allTerms %in% BP.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(BP.weight01))[allTerms],
                         Variance = rank(score(BPvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth(method='lm') + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of BP terms by significance')

Molecular function

allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% MF.pvalue.sigTerms,
                             Variance = allTerms %in% MF.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(MF.weight01))[allTerms],
                         Variance = rank(score(MFvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth(methdo='lm') + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of MF terms by significance')
## Warning: Ignoring unknown parameters: methdo

Cellular component

allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% CC.pvalue.sigTerms,
                             Variance = allTerms %in% CC.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(CC.weight01))[allTerms],
                         Variance = rank(score(CCvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth(method='lm') + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of CC terms by significance')

Session info

I save the main variables to be able to load them in a new R session later.

save(allgenes2GO,
     GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
     GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
     GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
     GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
     GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
     GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
     file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ca_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
##  [5] LC_MONETARY=ca_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8   
##  [7] LC_PAPER=ca_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.30.0     ggplot2_3.3.0        limma_3.42.2        
##  [4] knitr_1.28           topGO_2.38.1         SparseM_1.78        
##  [7] GO.db_3.10.0         AnnotationDbi_1.48.0 IRanges_2.20.2      
## [10] S4Vectors_0.24.3     Biobase_2.46.0       graph_1.64.0        
## [13] BiocGenerics_0.32.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.0.0   xfun_0.12          purrr_0.3.3        splines_3.6.3     
##  [5] lattice_0.20-41    colorspace_1.4-1   vctrs_0.2.4        htmltools_0.4.0   
##  [9] yaml_2.2.1         mgcv_1.8-33        blob_1.2.1         rlang_0.4.5       
## [13] pillar_1.4.3       glue_1.4.0         withr_2.1.2        DBI_1.1.0         
## [17] bit64_0.9-7        matrixStats_0.55.0 lifecycle_0.2.0    stringr_1.4.0     
## [21] munsell_0.5.0      gtable_0.3.0       memoise_1.1.0      evaluate_0.14     
## [25] labeling_0.3       fansi_0.4.1        highr_0.8          Rcpp_1.0.4        
## [29] scales_1.1.0       farver_2.0.3       bit_1.1-15.2       digest_0.6.25     
## [33] stringi_1.4.6      dplyr_0.8.5        cli_2.0.2          tools_3.6.3       
## [37] magrittr_1.5       RSQLite_2.2.0      tibble_3.0.0       crayon_1.3.4      
## [41] pkgconfig_2.0.3    Matrix_1.2-18      ellipsis_0.3.0     assertthat_0.2.1  
## [45] rmarkdown_2.1      R6_2.4.1           nlme_3.1-149       compiler_3.6.3